Method and tensor traversal engine for strided memory access during execution of neural networks

ABSTRACT

A tensor traversal engine in a processor system comprising a source memory component and a destination memory component, the tensor traversal engine comprising: a control signal register storing a control signal for a strided data transfer operation from the source memory component to the destination memory component, the control signal comprising an initial source address, an initial destination address, a first source stride length in a first dimension, and a first source stride count in the first dimension; a source address register communicatively coupled to the control signal register; a destination address register communicatively coupled to the control signal register; a first source stride counter communicatively coupled to the control signal register; and control logic communicatively coupled to the control signal register, the source address register, and the first source stride counter.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.17/331,590, filed on 26 May 2021, which claims the benefit of U.S.Provisional Application No. 63/030,183, filed on 26 May 2020, each ofwhich is incorporated in its entirety by this reference.

This application is related to U.S. Pat. No. 10,474,464, filed on 3 Jul.2018, and U.S. patent application Ser. No. 17/127,904, filed on 18 Dec.2020, which are each incorporated in its entirety by this reference.

TECHNICAL FIELD

This invention relates generally to the field of integrated circuitdesign and more specifically to a new and useful system for directmemory access of input tensors in the field of integrated circuitdesign.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a flowchart representation of a first method;

FIG. 2 is a flowchart representation of a second method;

FIG. 3 is a schematic representation of a system;

FIG. 4 is a flowchart representation of a third method;

FIG. 5 is a flowchart representation of one variation of the thirdmethod;

FIGS. 6A and 6B are schematic representations of variations of thetensor traversal engine;

FIGS. 7A, 7B, 7C, and 7D are conceptual representations of accesspatterns for the first method and the second method; and

FIG. 8 is a schematic representation of once component of the tensortraversal engine.

DESCRIPTION OF THE EMBODIMENTS

The following description of embodiments of the invention is notintended to limit the invention to these embodiments but rather toenable a person skilled in the art to make and use this invention.Variations, configurations, implementations, example implementations,and examples described herein are optional and are not exclusive to thevariations, configurations, implementations, example implementations,and examples they describe. The invention described herein can includeany and all permutations of these variations, configurations,implementations, example implementations, and examples.

1. Method

As shown in FIG. 1 , one variation of the method S100 includes:accessing a control signal defining an initial source address, aninitial destination address, a source block count, a first stride count,a first stride length, and a first stride dimension in Block S110;writing the initial source address to a source address register in BlockS120; writing the source block count to a source block counter in BlockS122; writing the first stride count to a first stride counter in BlockS124; and writing the initial destination address to a destinationaddress register in Block S126. The method S100 also includes, while avalue of the first stride counter is greater than zero and while a valueof the source block counter is greater than zero: reading a currentsource address from the source address register in Block S130; reading acurrent destination address from the destination address register inBlock S132; transferring a data word stored at the current sourceaddress in the source memory component to the current destinationaddress in the destination memory component in Block S140; incrementingthe current source address in the source address register in Block S150;incrementing the current destination address in the destination addressregister in Block S152; and decrementing the value of the source blockcounter in Block S154. The method S100 further includes, while a valueof the first stride counter is greater than zero and in response to thevalue of the source block counter equaling zero: advancing the currentsource address in the source address register based on the first stridelength and the first stride dimension in Block S160; decrementing thevalue of the first stride counter in Block S170; and rewriting thesource block count to the source block counter in Block S172.

As shown in FIG. 2 , the method S200 for executing a strided datatransfer operation from a source memory component to a destinationmemory component includes: writing, to a control signal register, acontrol signal: representing a source access pattern in the sourcememory component defining a first dimension and including a set ofsource data blocks in Block S210. The control signal includes an initialsource address, an initial destination address, a first source stridelength in the first dimension, and a first source stride count in thefirst dimension. The method S100 also includes: writing the initialsource address to a source address register; writing the first sourcestride count to a first source stride counter; and writing the initialdestination address to a destination address register in Block S220. Themethod S100 additionally includes transferring an initial source datablock stored at the initial source address to the initial destinationaddress in Block S230. The method S100 further includes, in response toa first current source stride count in the first source stride counterrepresenting at least one remaining source data block in the firstdimension of the source access pattern: reading a current source addressfrom the source address register and reading a current destinationaddress from the destination address register in Block S240;transferring a target source data block stored at the current sourceaddress to the current destination address in Block S250. The methodS200 further includes, in response to completing transfer of the targetsource data block: advancing the source address register based on thefirst source stride length, the first dimension, and the current sourceaddress in Block S260; advancing the destination address register inBlock S270; and decrementing the first current source stride count inthe first source stride counter in Block S280.

2. Tensor Traversal Engine

As shown in FIG. 3 , a tensor traversal engine 100 in a processor system200 comprising a source memory component 210 and a destination memorycomponent 220, the tensor traversal engine 100 including: a controlsignal register 110; a source address register 120; a destinationaddress register 130; a first source stride counter 142; and controllogic 160. The control signal register 110 is configured to store acontrol signal for a strided data transfer operation from the sourcememory component 210 to the destination memory component 220. Thecontrol signal: represents a source access pattern in the source memorycomponent 210 defining a first dimension and including a set of sourcedata blocks; and includes an initial source address, an initialdestination address, a first source stride length in the firstdimension, and a first source stride count in the first dimension. Thesource address register 120 is communicatively coupled to the controlsignal register 110 and configured to store a current source address.The destination address register 130 is communicatively coupled to thecontrol signal register 110 and configured to store a currentdestination address. The first source stride counter is communicativelycoupled to the control signal register no and configured to store afirst current source stride count in the first dimension. The controllogic 160 is configured to execute the strided data transfer operationby: writing the initial source address to the source address register120; writing the first source stride count to the first source stridecounter; and writing the initial destination address to the destinationaddress register 130. Additionally, the control logic 160 can executethe stride data transfer operation by, in response to a first currentsource stride count in the first source stride counter representing atleast one remaining source data block in the first dimension of thesource access pattern: reading a current source address from the sourceaddress register 120; reading a current destination address from thedestination address register 130; transferring the source data blockstored at the current source address to the current destination address;advancing the source address register 120 based on the first sourcestride length, the first dimension, and the current source address;advancing the destination address register 130; and decrementing a firstcurrent source stride count in the first source stride counter.

3. Applications

Generally, the methods S100 and S200 are executed by a tensor traversalengine (hereinafter “TTE”) arranged within a processor system 200 totransfer a set of non-contiguous data blocks from a source memorycomponent 210—according to a particular source access pattern (e.g., aone- or multi-dimensional strided access pattern)—to a destinationmemory component 220 based on a single control signal and in order toselectively access non-contiguous data blocks from arrays, matrices,and/or tensors without requiring multiple control signals and memoryaccess cycles of the TTE. More specifically, the TTE 100 is configuredto: receive a control signal defining a source address, a destinationaddress, and a source access pattern that specifies a source blockcount, a set of source stride counts, a set of source stride lengths,and a set of corresponding source surface dimensions; write the sourceaddress to a source address register 120; write the source block countto a source block counter 122; write the set of source stride counts toa corresponding set of source stride counters 140; and transfer datafrom the source memory component 210 to the destination memory component220 by advancing the source address according to the source accesspattern (e.g., the source stride lengths and corresponding sourcedimensions) and repeatedly decrementing and resetting the value of thesource block counter 122 and the values of the set of source stridecounters 140 in coordination with the advancing source address.

Thus, the TTE 100 can transfer strided, non-contiguous data—such as frommultiple locations of a receptive field within an input tensor duringexecution of a convolution operation—based on a single control signal byreplacing the series of control signals necessary for a standard TTE 100to access a set of strided data (e.g., multiple distinct controlsignals, each specifying a source memory address corresponding to eachcontiguous data block) with a single control signal cooperating with alarger number of counters and registers that track the TTE's progressionthrough the source access pattern. As a result, the TTE 100 ischaracterized by vastly improved transfer speeds for strided,non-contiguous data blocks between memory components within a processorsystem 200 at the expense of greater control signal complexity and alarger spatial footprint in the processor system 200 when compared todirect memory access engines.

In addition to accessing memory from a source memory component 210according to a particular source access pattern, as described above, theTTE 100 can also receive a control signal specifying a particulardestination storage pattern and transfer the accessed data blocks fromthe source memory component 210 into the destination memory component220 according to this destination storage pattern. Therefore, the TTE100 is configured to receive a control signal defining a destinationstorage pattern that specifies: a destination block count, a set ofdestination stride counts, a set of destination stride lengths, and aset of corresponding destination dimensions. The TTE 100 is furtherconfigured to: write the destination address to a destination addressregister 130; write the destination block count to a destination blockcounter; write the set of destination stride counts to a correspondingset of destination stride counters 150; and store data transferred fromthe source memory component 210, in the destination memory component 220by advancing the destination address according to the destination accesspattern (e.g., the destination stride lengths and correspondingdestination dimensions) and by repeatedly decrementing and resetting thevalue of the destination block counter and the values of the set ofdestination stride counters 150 in coordination with the advancingdestination address.

Thus, in addition to accessing strided, non-contiguous data blocks fromthe source memory component 210 and storing these data blocks within thedestination memory component 220 in a linear data format, the TTE 100can also reformat these accessed data blocks into a different strided,multi-dimensional output format, thereby reducing additional processingcycles typically utilized to reformat data for particular tensoroperations.

Additionally, the TTE 100 can include hardware-implemented componentsconfigured to: the data accessed from the source memory component 210during transfer to the destination memory component 220; change the bitlength of data (e.g., compress or expand) accessed from the sourcememory component 210 during transfer to the destination memory component220; transpose data accessed from the source memory component 210 duringtransfer to the destination memory component 220; and compress ordecompress encoded data accessed from the source memory component 210during transfer to the destination memory component 220. Furthermore,the TTE 100 can broadcast data accessed from the source memory component210 to multiple destination memory component 220 s.

3.1 Example: Convolutional Neural Networks

In one application of the TTE, a processor configured to executeconvolutional neural network (hereinafter “CNN”) based inferencealgorithms includes multiple instances of the TTE. In this application,the processor system 200 can receive a statically scheduled sequence ofinstructions to frequently transfer large four-dimensional tensors(representing inputs, weights, and/or outputs generated in a CNNinference) between memory components within the processor system 200. Astatic scheduler (further described in U.S. patent application Ser. No.17/127,904, which is incorporated by reference) can generate a staticschedule that defines multiple partitions, or chunks, of thesefour-dimensional tensors that the processor system 200 then transfersbetween memory components within the processor system 200. The TTE 100is configured, in hardware to efficiently (in terms of power usage andspeed) transfer these partitions within the processor system 200. Thus,the TTE 100 can access data according to various strided accesspatterns, further described below, that are commonly represented amongstthese partitions of four-dimensional tensors (e.g., a 32-by-64-by-3chunk from 224-by-224-by-3-by-1 tensor). Additionally, the TTE 100 isconfigured to execute additional operations inline, to reduce the loadon the processor cores of the process system during execution of a CNNinference algorithm. For example, as the TTE 100 transfers data betweenmemory components of the processor system 200, the TTE 100 can executeoperations such as such as data compression, data padding, bitexpansion, and data transposing.

4. Terminology

Generally, the TTE 100 is described herein as executing certain steps“in response to” particular conditions. In addition to describing anif-then logical relationship between the condition and the followingsteps, the phrase “in response to” as utilized herein can also describelooping or persistent conditional logic (e.g., a while loop). Forexample, the TTE 100 can continue to execute steps recited under the “inresponse to” phrase until the condition of the “in response to” phraseis no longer true.

Generally, the TTE 100 is described herein as “advancing” sourceaddresses and/or destination address in the source address register 120and/or the destination register respectively. As utilized herein,advancing a memory address is distinct from incrementing a memoryaddress in that advancement can occur both forward (positive) orbackward (negative) within the address space. Additionally, as utilizedherein, advancing a memory address can indicate an increase or decreaseof the memory address by multiple increments or steps (e.g., by skippingover intervening addresses within the address space). Likewise, phrasessuch as “progressing” or “stepping” may be utilized synonymously hereinto indicate advancement of a memory address in a register to a differentaddress based on the value of the prior address.

5. TTE Description

Generally, as shown in FIG. 3 , the TTE 100 defines a component arrangedin a processor system 200 (i.e., processor circuit), which can includemultiple memory components (in a memory hierarchy), such as main memory(i.e., primary memory), shared caches (i.e., L2 memory), and individualcaches (i.e., L1 memory) for each processing unit in the processorsystem 200, and includes: an address and/or control signal buffer 112, acontrol register, a data buffer 170, control logic 160, a source addressregister 120, a source block counter 122, a set of source stridecounters 140, a destination address register 130, a destination blockcounter 132, and/or a set of destination stride counters 150. In someimplementations, the TTE 100 can additionally include: a transposebuffer 172, a decompression logic (e.g., a Huffman decoder), and/or abit expansion and compression logic.

Generally, the TTE 100 can include a data buffer 170 configured to storedata accessed from the source memory component 210 prior to transfer tothe destination memory component 220. Thus, the data buffer 170 enablesthe TTE 100 to asynchronously transfer data from the source memorycomponent 210 to the destination memory component 220.

The processor system 200 can include multiple instances of the TTE, forwhich each instance of the TTE 100 is arranged between two memorycomponents in the processor system 200 and is configured to transferdata between these two memory components instead of transferring databetween any two memory components in the processor system 200 via thesystem interconnect. In one implementation, the processor system 200includes instances of the TTE 100 arranged between main memory and L2memory and instances of the TTE 100 arranged between L2 memory and L1memory.

However, the TTE 100 can include fewer or additional components to thosedescribed above, as necessary, to interface with the particularprocessor system 200 of which the TTE 100 is a component.

The TTE 100 includes a number of “registers” and “counters.” Generally,each “register” includes an array of flip flops, latches, or RAMinstances configured to store a value during execution of a datatransfer operation. “Registers” include “counters,” which specificallystore numerical values utilized for tracking the TTE's progressionthrough a source access pattern or a destination access pattern during adata transfer operation.

5.1 Control Signal Buffer and Control Register

Generally, the TTE 100 can include a control signal buffer 112 and acontrol register configured to receive and store control signals inputto the TTE 100 by the processor system 200. More specifically, the TTE100 can store control signals in a control signal buffer 112, eachcontrol signal specifying details of a memory transfer operation to beexecuted by the TTE 100 (further described below), such as the sourceand destination addresses and a set of variables representing a sourceaccess pattern and a destination storage pattern, and can dequeue (infirst-in-first-out order) these control signals to the control registerfor execution by the TTE. Thus, the TTE 100 can access—from the controlsignal register 110—instructions to execute a strided, non-contiguousmemory access operation.

The TTE 100 can receive control signals from a control processor fordynamically scheduled processes or from a queue of statically scheduledinstructions for statically scheduled processes. Additionally, eachcontrol signal can include the starting source address for the sourceaccess pattern and the starting destination address for the destinationstorage pattern.

5.2 Address Registers

Generally, the TTE 100 can include a source address register 120configured to store a current address, in the source memory component210, from which the TTE 100 accesses a data word and transfers this dataword to the destination memory component 220. Likewise, the TTE 100includes a destination address register 130 configured to store acurrent address, in the destination memory component 220, to which theTTE 100 can transfer a data word accessed from the source memorycomponent 210. Thus, the TTE 100 can advance these addresses accordingto the specified source access patterns and destination storagepatterns, thereby maintaining a current source memory address from whichthe TTE 100 can access a data word and a current destination memoryaddress to which the TTE 100 can store a data word during a datatransfer operation.

5.3 Contiguous Block Counters

Generally, the TTE 100 can include: a source block counter 122,configured to count (e.g., by successively decrementing the value of thesource block counter 122) the number of contiguous data words remainingfor a current contiguous data block in the set of contiguous data blocksspecified in the source access pattern; and a destination block counterconfigured to count (e.g., by successively decrementing the value of thedestination block counter) the number of contiguous words remaining foreach contiguous data block in the set of contiguous data blocksspecified in the destination storage pattern. Additionally, afteraccessing or storing the current contiguous data block, the TTE 100 canreset the value of the source block counter 122 or the destination blockcounter to match a source block count or destination block countindicated by the control signal, in preparation for access or storage ofthe next contiguous data block specified by either the source accesspattern or the destination storage pattern respectively. Thus, the TTE100 can repeatedly access or store contiguous data blocks of aconsistent size according to the source access pattern or thedestination storage pattern.

In one implementation, the TTE 100 can include a data bus configured totransfer a single data word. For example, if the processor system 200including the TTE 100 operates with 32-bit data words, the TTE 100 caninclude a 32-bit data bus in order to transfer singular data wordsbetween memory components in the processor system 200.

More specifically, the TTE 100 can access a source block count (i.e., asource block size, a source block count) for a source access pattern byaccessing the control signal register 110 storing a control signalincluding a source block count. Additionally, the TTE 100 can transferthe source block count to a source block counter 122 via the controllogic 160, to enable the TTE 100 to decrement a current source blockcount in the source block counter 122. Likewise, the TTE 100 can accessa destination block counter (i.e., a destination block size, adestination block count) for a destination access pattern by accessingthe control signal register 110 storing a control signal including adestination block count.

The TTE 100 can count the number of data words within each source datablock or destination block defined by the source access pattern or thedestination storage pattern respectively by executing a while loop thatcontinuously decrements the source block counter 122 or destinationblock counter.

More specifically, the TTE 100 can write the source block count to thesource block counter 122; and transfer a target source data block storedat the current source address to the current destination address by, inresponse to a current source block count in the source block counter 122representing at least one source data word remaining in the targetsource data block: transferring a source data word at the current sourceaddress to the current destination address in the destination addressregister 130; incrementing the source address register 120; incrementingthe destination address register 130; and decrementing the currentsource block count in the source block counter 122. Subsequently, inresponse to completing transfer of the target source data block, the TTE100 can reset the source block counter 122 to the source block countstored in the control signal register 110.

Alternatively, the TTE 100 can, instead of transferring each contiguousdata block directly to the destination memory component 220, transfereach source data word in a source data block into a data buffer 170 andsubsequently transfer contiguous destination blocks (characterized by adestination block count different from the source block count) from thedata buffer 170 to destination addresses in the destination memorycomponent 220. In this implementation, the TTE 100 can write adestination block count (included in the control signal stored in thecontrol signal register 110) to a destination block counter; andtransfer a target destination data block from the data buffer 170 to thedestination memory component 220 by, in response to a currentdestination block count in the destination block counter representing atleast one destination word remaining in the target destination block:transferring a destination word in the data buffer 170 to the currentdestination address in the destination address register 130;incrementing the destination address register 130; and decrementing thecurrent destination block count in the destination block counter.Subsequently, in response to completing transfer of the targetdestination block, the TTE 100 can reset the destination block counterto the destination block count stored in the control signal register110.

5.4 Stride Counters

Generally, the TTE 100 includes a set of source stride counters 140and/or a set of destination stride counters 150 in order to track thenumber of strides in each data transfer operation and in each dimension.For example, in implementations of the TTE 100 supporting data transferof four-dimensional tensors, the TTE 100 can include up to three sourcestride counters 140 and up to three destination stride counters 150 inorder to execute the source access pattern and the destination storagepattern respectively. Thus, upon completing access or storage of acontiguous data block (according to a value of a corresponding blockcounter), the TTE 100 can stride in a first dimension to anon-contiguous source or destination address and decrement a firststride counter prior to resetting the source or destination blockcounter and accessing or storing a subsequent contiguous data block. TheTTE 100 continues this process until a value of the first stride counteris equal to zero, in which case the TTE 100 can initiate a stride in adifferent dimension and decrement a second stride counter or, if the TTE100 is completing only a one-dimensional stride transfer operation, thenthe TTE 100 can complete the transfer operation and dequeue subsequentcontrol signals from the control signal buffer 112.

In one implementation, the TTE 100 includes three source stride counters140 and three destination stride counters 150 and can access strided,non-contiguous data from a four-dimensional input tensor in the sourcememory component 210 and reformat these data in four dimensions to storean output tensor in the destination memory component 220. In anotherimplementation, the TTE 100 includes three source stride counters 140,but no destination stride counters 150 and, as such, can only store datain the destination memory component 220 in a linear or contiguous formatbut can access data according to a four-dimensional strided accesspattern.

Generally, upon completing a set of strides along one dimension of asource access pattern or a destination storage pattern, the TTE 100 canadvance relevant memory addresses (e.g., either the current sourceaddress or the current destination address) based on the dimension ofthe stride relative to the multidimensional array representing thesurface at the source memory component 210 and the multidimensionalarray being generated in the destination memory component 220. Forexample, the TTE 100 can advance the current source address in thesource address register 120 by a factor associated with the dimension ofthe stride (e.g., representing a number of memory addresses thatrepresent a row in the source surface).

In implementations of the TTE 100 including a destination stridecounter, the TTE 100 includes a control signal register 110 configuredto store a control signal: representing a source access pattern in thesource memory component 210 defining a first dimension an including theset of source data blocks; representing a destination storage pattern inthe destination memory component 220 defining a second dimension andcomprising a set of destination blocks; and including the initial sourceaddress, the initial destination address, the first source stride lengthin the first dimension, the first source stride count in the firstdimension, a first destination stride length in the second dimension;and a first destination stride count in the second dimension. In thisimplementation, the TTE 100 can include control logic 160 configured toexecute the strided data transfer operation by: writing an initialsource address to the source address register 120; writing a firstsource stride count to the first source stride counter; writing aninitial destination address to the destination address register 130; andwriting a first destination stride count to the first destination stridecounter. Additionally, the control logic 160 can continue executing thestride data transfer operation by, in response to a first current sourcestride count in the first source stride counter representing at leastone remaining source data block in the first dimension of the sourceaccess pattern: reading the current source address from the sourceaddress register 120; reading the current destination address from thedestination address register 130; transferring the source data blockstored at the current source address to the current destination address;advancing the source address register 120 based on the first sourcestride length, the first dimension, and the current source address;advancing the destination address register 130 based on the firstdestination stride length and the current destination address;decrementing the first current source stride count in the first sourcestride counter; and decrementing a first current destination stridecount in the first destination stride counter.

In yet another implementation, the TTE 100 can include a set of stridecounters representing strides in a first dimension and in a seconddimension (e.g., representing a two-dimensional strided source accesspattern). In this implementation, the TTE 100 can first iterate througha set of strided data blocks in a first dimension; and, upon completionof this set of strided data blocks, reset a first stride counter, beforestriding a second dimension. More specifically, the TTE 100 can includea control register configured to store a control signal: representingthe source access pattern in the source memory component 210 defining afirst dimension, defining a second dimension, and including the set ofsource data blocks; and including the initial source address, theinitial destination address, the first source stride length in the firstdimension, the first source stride count in the first dimension, asecond source stride length in the second dimension, and a second sourcestride count in the second dimension. In this implementation, the TTE100 also includes a second source stride counter communicatively coupledto the control signal register no and configured to store a secondcurrent source stride count in the second dimension. Additionally, inthis implementation, the TTE 100 includes control logic 160 configuredto execute the strided data transfer operation by: writing the initialsource address to the source address register 120; writing the firstsource stride count to the first source stride counter; writing thesecond source stride count to the second source stride counter; andwriting the initial destination address to the destination addressregister 130. The control logic 160 is further configured to execute thestrided data transfer operation by, in response to the first currentsource stride count in the first source stride counter representing atleast one remaining source data block in the first dimension of thesource access pattern and in response to a second current source stridecount in the second source stride counter representing at least oneremaining source data block in the second dimension of the source accesspattern: reading the current source address from the source addressregister 120; reading the current destination address from thedestination address register 130; transferring the source data blockstored at the current source address to the current destination address;advancing the source address register 120 based on the second sourcestride length, the second dimension, and the current source address;advancing the destination address register 130; and decrementing thesecond current source stride count in the second source stride counter.

In this implementation, the TTE 100 continues decrementing the currentsource stride count in the second source stride counter until the stridecounter indicates there are no additional strides remaining in thesecond dimension of the source access pattern. More specifically, thecontrol logic 160 continues executing the strided data transferoperation by, in response to the first current source stride count inthe first source stride counter representing at least one remainingsource data block in the first dimension of the source access patternand in response to the second current source stride count in the secondsource stride counter representing no remaining source data blocks inthe second dimension of the source access pattern: resetting the secondsource stride counter to the second source stride count; advancing thesource address register 120 based on the first source stride length, thefirst dimension, and the current source address; and decrementing thefirst current source stride count in the first source stride counter.

In yet another implementation, the TTE 100 can include a third dimensionand execute a third while loop implemented in hardware in order tocomplete a set of strides in the third dimension, prior to striding inthe second dimension and resetting the third stride counter for thethird dimension. Upon completing the strides in the second dimension,the TTE 100 can reset the second stride counter for the second dimensionand stride in the first dimension. In this manner, the TTE 100 cantransfer data blocks via a three-dimensional strided source accesspattern.

In yet another implementation, the TTE 100 can include a fourthdimension and execute a fourth while loop implemented in hardware inorder to complete a set of strides in a fourth dimension. Thus, the TTE100 can support any number of strided dimensions for the source accesspattern or the destination storage pattern for the strided data transferoperation.

5.5 Data Buffer

Generally, the TTE 100 can include a data buffer 170 configured to storesource data blocks from the source memory component 210 prior totransfer to the destination memory component 220. Thus, the TTE 100 can:transfer a source data block into the data buffer 170; store this sourcedata block within the data buffer 170; and, in response to receiving busaccess from the processor system 200; asynchronously transfer the sourcedata block to the destination memory component 220.

More specifically, the data buffer 170 is communicatively coupled to theread and write ports of the control logic 160 enabling the data buffer170 to receive and disperse data blocks over the communication buses ofthe processor system 200. The TTE, via the data buffer 170 can,therefore, transfer the target source data block stored at the currentsource address to the current destination address by: at a first time,loading the target source data block from the current source addressinto a data buffer 170; and at a second time, transferring the targetsource data block from the data buffer 170 to the current destinationaddress. Consequently, the TTE 100 can avoid occupying the system bus ofthe processor system 200 for an extended number of consecutive cyclesand also maintain high utilization of both the source memory componentand the destination memory component during the strided data transferoperation.

In particular, the TTE 100 can transfer a source data block stored at acurrent source address in the source address register 120 to the databuffer 170 based on a current source block count in the source blockcounter 122 by, in response to a current source block count in thesource block counter 122 representing at least one source data wordremaining in the source data block: enqueuing a source data word storedat a current source address in the source address register 120 to thedata buffer 170; advancing the current source address in the sourceaddress register 120; and decrementing the current source block count inthe source block counter 122. Concurrently and/or asynchronously, theTTE 100 can remove data blocks from the data buffer 170 by, in responseto a current destination block count in the destination block counterrepresenting at least one destination word remaining in the destinationblock: dequeuing a source data word stored in the data buffer 170 totransfer the source data word to the current destination address in thedestination memory component 220; incrementing the current destinationaddress in the destination address register 130; and decrementing thedestination block count in the destination block counter. Thus, the TTE100 can execute two separate, and optionally simultaneous, while loopsto asynchronously transfer data blocks to and from the data buffer 170,thereby transferring these complete data blocks from the source memorycomponent 210 to the destination memory component 220.

5.6 Transpose Buffer

In implementations in which the TTE 100 is configured to transposeaccessed data during a transfer operation, as shown in FIG. 8 , the TTE100 can include a transpose buffer 172 configured to efficientlytranspose data stored in the transfer buffer after access from thesource memory component 210 and prior to storage in the destinationmemory component 220 (e.g., by improving transfer bus bandwith betweenthe source memory component 210 and the transfer buffer). Morespecifically, the transpose buffer 172 can include a square array offlip flops, latches, or single word RAM instances, and the TTE 100 isconfigured to store data in the transpose buffer 172 in one orientationand access data from the transpose buffer 172 in a second orientationtransposed from the first orientation, thereby transposing the datainput to the transpose buffer 172. Thus, while transposing data from thesource memory component 210, the TTE 100 can transfer data from thesource memory component 210 to the transpose buffer 172 using the fulltransfer bus (e.g., 32 bytes of a 32 Byte transfer bus) as opposed toaccessing individual bit-words (e.g., 1 byte of a 32 Byte transfer bus)in a specific order in order to transpose the data into the destinationmemory location.

In one implementation, the TTE 100 can transfer data into the transposebuffer 172 instead of into the data buffer 170, thereby enabling thetranspose buffer 172 to serve multiple functions (e.g., as both a bufferenabling asynchronous data transfer and a means for transposing dataduring the data transfer process). More specifically, the system cantransfer a target source data block stored at a source address in thesource memory component 210 to a destination address in the destinationmemory component 220 by: loading the target source data block from thecurrent source address into a transpose buffer 172 according to a firstbuffer dimension of the transpose buffer 172; and transferring thetarget source data block from the transpose buffer 172 according to asecond buffer dimension of the transpose buffer 172.

In another implementation, the TTE 100 includes a transpose buffer 172similarly communicatively coupled to the read and write ports of thecontrol logic 160, thereby enabling data blocks to be directlytransferred to and from the transpose buffer 172.

In these implementations of the TTE 100, the TTE 100 can supporttransposes between any two dimensions of a multidimensional tensortemporarily stored in the transpose buffer 172 during the strided datatransfer operation. In these implementations, the TTE 100 can store acontrol signal specifying the particular dimensions to transpose withinthe multidimensional tensor. In one example, for a multidimensionaltensor defining an image height dimension, an image width dimension, acolor dimension, and a batch dimension, the TTE 100 can access a fieldin the control signal stored in the control signal register indicating atranspose between the image height dimension and the image widthdimension. Alternatively, the TTE 100 can execute a transpose of thecolor and batch dimensions. Thus, the transpose buffer 172 is configuredto transpose between any two dimensions of a multidimensional tensor.

6. Control Logic

Generally, the TTE 100 includes control logic 160 configured to executethe method S100. More specifically, the control logic 160 includes a setof logic gates, registers, and communication ports configured as afinite state machine to execute the methods S100 and S200. Thus, thecontrol logic 160 interfaces with each of the registers and counters inthe TTE 100 and interfaces with control processors, processing units,and memory components. In one implementation, the control logic 160 caninclude a set of ports such as DMA request, DMA acknowledge, read,write, and interrupt ports. Thus, the control logic 160 is configured toexecute the strided data transfer operation by: transferring values fromthe control signal register 110 to other counters and registers in theTTE 100 prior to initiating a transfer cycle; reading and writing datablocks to and from the data buffer 170 and/or transpose buffer 172;calculating and coordinating the advancement of source addresses anddestination addresses according to stride lengths, associateddimensions, and the indicated topology of the source access pattern anddestination storage pattern (as defined by the control signal);resetting stride counters and block counters in order to track thenumber of strides and/or the number of contiguous blocks that have beentransferred in a single transfer cycle; and, upon detecting completionof a data transfer operation, writing a subsequent control signal to thecontrol signal register 110. Therefore, by combining these operationsaccording to the contents of the control signal, the control signal isconfigured to execute Blocks of the methods S100, S200, and S300.

7. Operation

Generally, the above-described TTE, executes Blocks of the method S100,S200, S300 in order to access strided, non-contiguous data blocks of asource surface (e.g., an array, matrix, or tensor stored at a sourcememory component 210) and stores these data blocks at a destinationmemory component 220 via execution multiple transfer cycles. During eachtransfer cycle, the TTE 100 transfers a series of contiguous blocksalong a single dimension of the strided source access pattern. Thus, inorder to transfer data blocks according to a multidimensional stridedaccess pattern or stride destination storage pattern, the TTE 100 canexecute multiple nested transfer cycles.

In particular, in order to transfer a source data word stored at acurrent source address to a current destination address, the TTE 100 canat a first time, load the source data word from the current sourceaddress into a data buffer 170; and at a second time, transfer thesource data word from the data buffer 170 to the current destinationaddress.

More specifically, the TTE 100 can: receive and/or access a controlsignal; write addresses and values from the control signal to the sourceaddress register 120, the destination address register 130, the sourceblock counter 122, the destination block counter, the set of sourcestride counters 140, and/or the set of destination stride counters 150;execute a series of nested while loops (e.g., transfer cycles) to accessnon-contiguous data blocks across the source surface according to thesource access pattern; and/or execute a series of nested while loops tostore these non-contiguous data blocks on a destination surfaceaccording to the destination storage pattern. Thus, the TTE 100 can,with a single control signal, complete a complex series of data blocktransfers that, when executed on a standard TTE, require a number ofcontrol signals equal to the number of data blocks in the source accesspattern.

7.1 Control Signal Access

Generally, the TTE 100 can access a control signal and interpretinstructions for a strided transfer based on the control signal. Morespecifically, the TTE 100 can access a control signal in order toinitiate a strided transfer by writing a control signal from the controlsignal buffer 112 to the control register. Alternatively, the TTE 100can receive the control signal directly from a control processorincluded within the processor system 200. Thus, by continually receivingcontrol signals in the control signal buffer 112 and sequentiallywriting these control signals to the control register, the TTE 100 cancomplete a series of strided transfer operations in accordance with ascheduled task for the processor system 200.

Each control signal defines an initial source address (e.g.,corresponding to the lowest address value within the source surface), aninitial destination address (e.g., corresponding to the lowest addressvalue within the destination surface), a source block count, and a setof variables defining the source access pattern and/or the destinationstorage pattern such as those shown in FIGS. 7A, 7B, 7C, and 7D. In oneimplementation, the set of variables defining the source access patternincludes a source stride count, a source stride length, and a sourcestride dimension (e.g., for implementations of the TTE 100 supportingmulti-dimensional strides) for each stride dimension in the sourceaccess pattern. For example, for a two-strided access pattern, thecontrol signal defines a first source stride count, a first sourcestride length, a first source stride dimension, a second source stridecount, a second source stride length, and a second source stridedimension. Likewise, the control signal can similarly define adestination storage pattern by including a destination stride count, adestination stride length, and a destination stride dimension for stridedimension in the destination storage pattern.

In one implementation, the control signal can also include a definitionof the source surface and or the destination surface by describing therepresentation of the source surface or destination surface in terms ofthe dimension of these surfaces. For example, the control signal canindicate that the source surface spans 32 data words in a firstdimension, 32 data words in a second dimension, 32 data words in a thirddimension, and three data words in a fourth dimension. Therefore, theTTE 100 can calculate the number of addresses to advance when executinga stride in each of the dimensions. For example, given the examplesource surface above, the TTE, when executing a stride of length one inthe second dimension, advances the value of the source address register120 by 32 data words minus the source block count. Likewise, given theexample source surface, the TTE, when executing a stride of length twoin the third dimension, advances the source address register 120 by32×32×2=2048 data words minus the source block count.

In another implementation, the TTE 100 can access control signals thatindicate the source memory component 210 and the destination memorycomponent 220 for a strided transfer operation in implementations inwhich the TTE 100 is connected to multiple source memory component 210 sand/or multiple destination memory component 220 s. The TTE 100 can alsoaccess control signals that indicate broadcast functionality and causethe TTE 100 to transfer non-contiguous data blocks to multipledestination memory component 220 s.

In yet another implementation, the TTE 100 can access control signalsindicating differences between bit length of the source surface and adesired bit length of the destination surface. Thus, the TTE 100 canchange the bit length (e.g., via bit expansion or bit compression) ofeach data word during transfer of the data word from the source memorycomponent 210 to the destination memory component 220.

7.2 Strided Transfer

Generally, to initiate a strided transfer operation, the TTE 100initializes counters and registers in preparation for executing a seriesof nested while loops based on the values of these registers. In aninitialization step, the TTE: writes the initial source address to thesource address register 120; writes the initial destination address tothe destination address register 130; writes the source block count tothe source block counter 122 (and/or the destination block count to thedestination block counter); and, for each strided dimension in thesource access pattern, writes the source stride count to the sourcestride counter.

Once the TTE 100 populates the registers and counters with thecorresponding values from the control signal, the TTE 100 can access afirst contiguous data block in the source access pattern and transferthis data block to the data buffer 170 of the TTE. To accomplish this,the TTE: reads a current source address from the source address register120 (e.g., the initial source address for the first data word); accessesthe data word stored at the current source address in the source memorycomponent 210; transiently stores the data word in the data buffer 170;decrements the source block counter 122; and advances the current sourceaddress in the source address register 120 to the subsequent address.The TTE 100 can repeat this process until the value of the source blockcounter 122 is equal to zero or otherwise represents that a number ofsource data words have been transferred equal to the source block count,thereby indicating that a single data block has been accessed by theTTE. In response to the value of the source block counter 122 equalingzero, the TTE: resets the source block counter 122 to the source blockcount; advances a current source address in the source address register120 by the first stride length minus the source block count; decrementsthe first source stride counter; and initializes a second iteration ofthe above-described block counter loop in order to access a secondcontiguous data block in the source access pattern. The TTE 100 cancontinue this process of accessing a contiguous data block and advancingthe current source address based on the first source stride length untilthe value of the first source stride counter is equal to zero orotherwise indicates that all of the strides in this first dimension arecomplete.

In one implementation, instead of populating the source block counter122 and/or the stride counters with values from the control signalregister no during initialization, the TTE 100 can increment a count inthe source block counter 122, the destination block counter, the set ofsource stride counter, and/or the set of destination stride counter anddetect when this count equals the source block count, the destinationblock count, the source stride count, or the destination stride countrespectively. Thus, in this implementation, the control logic 160executes comparisons with the control register instead of detecting aminimum value (e.g., zero) of the count in order to identify completionof a transfer cycle.

In implementations or when executing operations in which the TTE 100 isonly executing a stride in a single dimension, the TTE 100 ceasesaccessing the source memory component 210 upon completion of thetransfer cycle. However, in implementations or operations in which theTTE 100 is executing strides in multiple dimensions, the aforementionedloops (based on the first source stride counter and the source blockcounter 122 respectively) are nested within additional source stridecounter loops. More specifically, in response to the value of the firststride counter being equal to zero (or otherwise indicating that nostrides remain in the first dimension, as described above), the TTE:resets the value of the first stride counter to the first stride count;advances the current source address in the source address register 120according to the stride length in the second stride dimension (i.e., thefirst dimension of the source surface multiplied by the stride lengthminus the source block count); and decrements the value of a secondsource stride counter. Thus, the TTE 100 can execute astride-counter-based loop for each stride dimension in the source accesspattern.

More specifically, in order to advance the source address register 120or the destination address register 130 (upon completion of a nestedtransfer cycle), the TTE 100 can advance the source (or destination)address register based on the first source (or destination) stridelength, the dimension associated with that stride, and the currentaddress stored within the relevant register by: calculating a source (ordestination) address step size by multiplying the first source (ordestination) stride length by a dimensional factor for the relevantdimension and subtracting by a source (or destination) block count; andadvancing the current source (or destination) address in the source (ordestination) address register by the source (or destination) addressstep size.

In one example in which the dimension represents a height of an inputsurface stored in the source memory component 210, the TTE 100 canutilize a dimensional factor for the dimension equal to the length ofeach row in the input surface. Therefore, if the stride length in thedimension is equal to three, the address step size is equal to the threetimes the row length of the inputs surface minus the contiguous blockcount.

For an application including a three-dimensional strided source accesspattern, the TTE 100 can execute the following steps in order totransfer the set of source data blocks represented by the source accesspattern to the destination memory component 220. More specifically, theTTE 100 can write, to the control signal register 110, a control signalrepresenting a source access pattern in the source memory component 210defining a first dimension, a second dimension, and a third dimensionand including a set of source data blocks. Additionally, the controlsignal includes: an initial source address; an initial destinationaddress; a first source stride length in the first dimension; a firstsource stride count in the first dimension; a second source stridelength in the second dimension; a second source stride count in thesecond dimension; a third source stride length in the third dimension;and a third source stride count in the third dimension. The TTE 100 caninitialize the source stride counters 140 by: writing the first sourcestride count to the first source stride counter; writing the secondsource stride count to the second source stride counter; and writing thethird source stride count to a third source stride counter. The TTE 100can then execute a nested transfer cycle of the strided data transferoperation by in response to the first current source stride count in thefirst source stride counter representing at least one remaining sourcedata block in the first dimension of the source access pattern, inresponse to the second current source stride count in the second sourcestride counter representing at least one remaining source data block inthe second dimension of the source access pattern, and in response to athird current source stride count in the third source stride counterrepresenting at least one remaining source data block in the thirddimension of the source access pattern: reading the current sourceaddress from the source address register 120; reading the currentdestination address from the destination address register 130;transferring the target source data block stored at the current sourceaddress to the current destination address (e.g., via the data buffer170); advancing the source address register 120 based on the thirdsource stride length, the third dimension, and the current sourceaddress; advancing the destination address register 130; anddecrementing the third current source stride count in the second sourcestride counter. The TTE 100 can then detect completion of the transfercycle in response to the first current source stride count in the firstsource stride counter representing at least one remaining source datablock in the first dimension of the source access pattern, in responseto the second current source stride count in the second source stridecounter representing at least one remaining source data blocks in thesecond dimension of the source access pattern, and in response to thethird current source stride count in the third source stride counterrepresenting no additional source data blocks in the third dimension ofthe source access pattern. The TTE 100 can then, resetting the thirdsource stride counter to the third source stride count.

Upon resetting the third source stride counter to the third sourcestride count, the TTE 100 can, in response to the first current sourcestride count in the first source stride counter representing at leastone remaining source data block in the first dimension of the sourceaccess pattern and in response to a second current source stride countin the second source stride counter representing at least one remainingsource data block in the second dimension of the source access pattern:advance the source address register 120 based on the second sourcestride length, the second dimension, and the current source address;advance the current destination address in the destination addressregister 130; read the current source address from the source addressregister 120; read the current destination address from the destinationaddress register 130; transfer the target source data block stored atthe current source address to the current destination address; anddecrement the second current source stride count in the second sourcestride counter. Thus, between completing transfer cycles in the thirddimension of the source access pattern, the TTE 100 can execute a stridealong the second dimension of the source access pattern.

After completing many nested transfer cycles along the third dimensionof the source access pattern and executing a stride in the seconddimension for each of those transfer cycles, the TTE 100 completes atransfer cycles along the second dimension. Thus, the TTE 100 can, inresponse to the first current source stride count in the first sourcestride counter representing at least one remaining source data block inthe first dimension of the source access pattern and in response to thesecond current source stride count in the second source stride counterrepresenting no remaining source data blocks in the second dimension ofthe source access pattern: reset the second source stride counter to thesecond source stride count; advance the source address register 120based on the first source stride length, the first dimension, and thecurrent source address; advance the current destination address in thedestination address register 130; read the current source address fromthe source address register 120; read the current destination addressfrom the destination address register 130; transfer the target sourcedata block stored at the current source address to the currentdestination address; and decrement the first current source stride countin the first source stride counter.

Upon completion of the highest-level transfer cycle (e.g., the transfercycle for the first dimension or the dimension which is not nestedwithin another transfer cycle), the TTE 100 completes the strided datatransfer operations and write a subsequent control signal to the controlsignal register 110.

In one implementation capable of executing three-dimensional strides,the TTE 100 can write, to the control signal register 110, a controlsignal representing a source access pattern: defining the firstdimension representing an input height of an input surface; defining thesecond dimension representing an input width of the input surface; anddefining the third dimension representing an input depth of the inputsurface. In this implementation, the input surface can be represented inthe source memory component 210 (and in the destination memory component220 upon completion of the transfer operation) as amultidimensional-array or array of arrays.

As the TTE 100 executes the above-described loops to access data blocksfrom the source surface and enqueue these data blocks in the data buffer170, the TTE 100 can concurrently and asynchronously dequeue these datablocks from the data buffer 170 to a current destination address in thedestination address register 130. The TTE 100 can then execute the sameform of nested loops operating based on the destination address register130, the destination block counter, and the set of destination stridecounters 150 in order to dequeue blocks from the data buffer 170 andstore these blocks on the destination surface in the destination memorycomponent 220 according to the destination access pattern.

In one implementation, in addition to executing a strided data transferoperation characterized by a multidimensional strided access pattern,the TTE 100 can also execute strided access patterns including negativestride lengths. In this implementation, when advancing source ordestination addresses based on a negative stride length, the TTE 100 candecrease the value of the address in the address register for eachstride count. Furthermore, the TTE 100 can include a control signalregister 110 configured to store signed binary integers to enable thecontrol logic 160 to identify negative stride lengths in the controlsignal.

Thus, the TTE 100 can enqueue successive data words to the data buffer170 via a first series of nested transfer cycles or loops operating onthe set of source registers and counters according to the source accesspattern and dequeue successive data words from the data buffer 170 via asecond series of nested operating on the set of destination registersand counter according to the destination access pattern. Morespecifically, the TTE 100 can write, to the control signal register 110,a control signal: representing a source access pattern in the sourcememory component 210 defining the first dimension and including the setof source data blocks in the source memory component 210; andrepresenting a destination storage pattern in the destination memorycomponent 220 defining a second dimension and including a set ofdestination blocks. In this implementation, the control signal includesthe initial source address, the initial destination address, the firstsource stride length in a first dimension, the first source stride countin the first dimension, a first destination stride length in a seconddimension, and a first destination stride count in the second dimension.Additionally, the TTE 100 can initialize the strided data transferoperation by writing the first destination stride count to a firstdestination stride counter. Subsequently, during a transfer cycle to thedestination memory component 220, the TTE 100 can, in response to thefirst current source stride count in the first source stride counterrepresenting at least one at least one remaining source data block inthe first dimension of the source access pattern and in response tocompleting transfer of the target source data block, advance thedestination address register 130 based on the first destination stridelength and the current destination address and decrement a currentdestination stride count in the first destination stride counter.

In another implementation, the TTE 100 can execute a separate set ofnest transfer cycles in order to execute a multidimensional stridedtransfer from the data buffer 170 to the destination memory component220, such that the source data blocks are rearranged into a distinctlypatterned strided destination storage pattern upon transfer to thedestination memory component 220.

In yet another implementation, the TTE 100 can generate and/or introducea predetermined (e.g., by the control signal) constant pattern of valuesfor inclusion in the destination surface (i.e., output surface). In thisimplementation, the TTE 100 can selectively fill regions of the databuffer 170 with the predetermined constant value or with a predeterminedconstant pattern. Thus, the TTE 100 can transfer these constant orpattern values from the data buffer 170 to the destination storageduring the set of destination transfer cycles.

7.2.1 Dimension Mapping

In one implementation, the TTE 100 can execute a dimensionaltransformation between a source surface and a destination surface inorder to rotate the representation of the source surface upon storage inthe destination surface. In this implementation, the TTE 100 can modifythe order of the nested loops and instead advance the currentdestination address over a second dimension before advancing in a firstdimension, thereby transforming the first dimension of the sourcesurface to the second dimension of the destination surface. In thismanner, the TTE 100 can modify the dimensional mapping of the surfaceduring transfer between memory components in the processor system 200.

Alternatively, the TTE 100 can map dimensions from the input surface tothe destination surface by executing a set of transpose operations andmaintaining linear destination address incrementation. For example, theTTE 100 can receive a control signal specifying a particular sourceaccess pattern (indicating strides in various dimensions) and alsospecifying transpose operations for specific data blocks transferredaccording to the source access pattern. Thus, by modifying the sourceaccess pattern and selectively transposing data blocks from the sourcememory component 210, the TTE 100 can modify the dimensions of thedestination surface in comparison to the source surface.

7.3 Padding

In one implementation, the TTE 100 can selectively add padding alongspecified edges of the destination surface, at a specified depth, and ofa specified type. More specifically, the TTE 100 can selectivelygenerate data words indicating the appropriate padding values inaccordance with the values stored in the destination counters andregisters. More specifically, the TTE 100 can: at a first time, load thetarget source data block from the current source address into a databuffer 170; at a second time, transfer the target source data block fromthe data buffer 170 to the current destination address; and appendpadding data to the target source data block in the data buffer 170.

For example, in response to reading particular values corresponding toedges of the destination surface (e.g., a destination stride countervalue of zero indicating a contiguous block on the edge of thedestination surface), the TTE 100 can substitute a data wordrepresenting a padding value instead of dequeuing a data word from thedata buffer 170. Thus, the TTE 100 can add padding to the destinationsurface in order to further improve the efficiency of convolutionoperations of the processor system 200. In this implementation, the TTE100 can execute multiple types of padding including zero padding,replication padding, and reflection padding.

8. Custom Pattern Variation

Generally, the TTE 100 can be configured to execute a custom datatransfer operation (e.g., to transfer a non-strided and non-contiguousset of source data blocks) from a source memory component 210 to adestination memory component 220. In this variation, the TTE 100 canreference a source pointer array to identify the memory address andblock counts for each source data block in the set of source datablocks. The TTE 100 can also include specific counters, addressregisters, and/or buffers in order to process this pointer array inorder to access the reference memory addresses and source block lengthsfor each source data block in the set of source data blocks. The TTE 100can then iterate through the source pointer array and transfer eachcontiguous source data block to the data buffer 170 and, concurrently orasynchronously, transfer each source data block to a series ofdestination blocks in the destination memory component 220. Thus, inaddition to specific strided source access patterns, the TTE 100 cantransfer any set of non-contiguous blocks from a source memory component210 to a destination memory component 220 based on a reference to asource pointer array, thereby further improving the flexibility of theTTE 100 at the expensive of only a few additional hardware components.

In this variation, the TTE 100 can write a control signal to the controlsignal register 110 that specifies a type of transfer operation (e.g., astrided data transfer operation or a custom data transfer operation).Additionally or alternatively, the TTE 100 can write a control signal tothe control signal register 110 that separately specifies the sourceaccess pattern and the destination storage pattern, such that the TTE100 can execute hybrid data transfer operations (e.g., by transferring aset of source data blocks arranged according to a custom source accesspattern to a set of destination blocks arranged according to a strideddestination access pattern or by transferring a set of source datablocks arranged according to a strided source access pattern to a set ofdestination blocks arranged according to a custom destination accesspattern). Thus, a user or application may specify, via control signalissued to the TTE 100 any combination of source access patterns anddestination storage patterns for a data transfer operation between asource memory component 210 and a destination memory component 220.

Additionally, the TTE 100 can execute a custom data transfer operationfor subset of dimensions of an input surface while executing a strideddata transfer operation for other dimensions of the input surface. Thus,the TTE 100 can execute hybrid transfer operation for which the TTE 100executes a strided access pattern in one dimension (and iterates througha transfer cycle to transfer strided source data blocks in thisdimension), while iterating through a pointer array defining a customsource access pattern in a second dimension. Thus, a user or applicationof the TTE 100 can balance the advantages and disadvantages of the ofthe strided access pattern and the custom access pattern on adimension-by-dimension basis.

8.1 Custom Pattern Variation: Method

As shown in FIG. 4 , a method S300 for executing a data transferoperation from a source memory component 210 to a destination memorycomponent 220 includes: writing, to a control signal register 110, acontrol signal representing a custom source access pattern comprising aset of source data blocks in the source memory component 210, thecontrol signal including a base pointer array address and an initialdestination address in Block S310; accessing a pointer array at the basepointer array address, the pointer array comprising a set of pointerarray elements, each pointer array element representing a source datablock in the set of source data blocks and including a source addressfor the source data block and a source block count for the source datablock in Block S320; writing the initial destination address to adestination address register 130 in Blocks S330. The method S300 alsoincludes, for each pointer array element in the set of pointer arrayelements: writing the source address for the source data block to asource address register 120 in Block S340; writing the source blockcount for the source data block to a source block counter 122 in BlockS342. The method S300 further includes, for each pointer array elementin the set of pointer array elements and in response to a current sourceblock count in the source block counter 122 representing at least onesource data word remaining in the source data block: transferring asource data word stored at a current source address in the sourceaddress register 120 to a current destination address in the destinationaddress register 130 in Block S350; incrementing the current sourceaddress in the source address register 120 in Block S360; incrementingthe current destination address in the destination address register 130in Block S370; and decrementing the current source block count in thesource block counter 122 in Block S380.

As shown in FIG. 5 , one variation of the method S300 includes: writing,to a control signal register 110, a control signal representing a customsource access pattern comprising a set of source data blocks in thesource memory component 210 representing a custom destination storagepattern comprising a set of destination blocks in the destination memorycomponent 220, and including a base source pointer array address and abase destination pointer array address in Block S312; accessing a sourcepointer array at the base source pointer array address, the sourcepointer array comprising a set of source pointer array elements, eachsource pointer array element: representing a source data block in theset of source data blocks and including a source address for the sourcedata block and a source block count for the source data block in BlockS320. This variation of the method S300 also includes, for each sourcepointer array element in the set of source pointer array elements:writing the source address for the source data block represented by thesource pointer array element to a source address register 120 in BlockS340; writing the source block count for the source data blockrepresented by the source pointer array element to a source blockcounter 122 in Block S342; transferring the source data block at acurrent source address in the source address register 120 to a databuffer 170 based on a current source block count in the source blockcounter 122 in Block S352. This variation of the method S300additionally includes, accessing a destination pointer array at the basedestination pointer array address, the destination pointer arraycomprising a set of destination pointer array elements, each destinationpointer array element: representing a destination block in the set ofdestination blocks and including a destination address for thedestination block and a destination block count for the destination datablock in Block S322. This variation of the method S300 further includes,for each destination pointer array element in the set of destinationpointer array elements: writing the destination address for thedestination block represented by the destination pointer array elementto a destination address register 130 in Block S344; writing thedestination block count for the destination data block represented bythe destination pointer array element to a destination block counter inBlock S346; and transferring a source data block stored in the databuffer 170 to a current destination address in the destination addressregister 130 based on a current destination block count in thedestination block counter in Block S354.

8.2 Custom Pattern Variation: System

As shown in FIGS. 6A and 6B tensor traversal engine in a processorsystem 200 comprising a source memory component 210 and a destinationmemory component 220, the tensor traversal engine including: a controlsignal register 110 configured to store a control signal for a datatransfer operation from the source memory component 210 to thedestination memory component 220, the control signal: representing acustom source access pattern comprising a set of source data blocks inthe source memory component 210; representing a custom destinationstorage pattern comprising a set of destination blocks in thedestination memory component 220; and including a base source pointerarray address and a base destination pointer array address. The TTE 100also includes: a source address register 120; a source block counter122; a destination address register 130; a destination block counter132; a data buffer 170. The TTE 100 further includes control logic 160communicatively coupled to: the control signal register 110; the sourceaddress register 120; the source block counter 122; the destinationaddress register 130; and the destination block counter.

8.3 Pointer Arrays

Generally, the custom pattern variation reference source and/ordestination pointer array that define a custom source access patterand/or a custom destination access pattern respectively. The processorsystem 200 can store a pointer array in a region of the source memorycomponent 210 in a region of the destination memory component 220, or ina separate memory component of the processor system 200. The sourcepointer array and the destination pointer array include a set of pointerarray elements, each pointer array element including a source address(for a source data block) or a destination address (for a destinationaddress) as well as a block length (expressed as a number of data words)of the corresponding source or destination data block. Thus, byaccessing a pointer array element in a source or destination pointerarray, the TTE 100 can identify both the location (i.e., a sourceaddress or a destination address) and a size of each contiguous datablock in the transfer pattern.

In one implementation, the TTE 100 can access a source or destinationpointer array that stores relative source or destination addresses inorder to compress the size of the source or destination pointer array.For example, the TTE 100 can access a source or destination pointerarray including a source address defined relative to the initial sourceaddress of the pointer array or the base address of the pointer arrayitself.

8.4 Pointer Array Queue

As show in FIG. 6A, in one implementation of the custom patternvariation of the TTE 100, the TTE 100 can iterate through the sourcepointer array and the destination pointer array by loading these pointerarrays into a corresponding queue (e.g., within a memory device includedin the TTE) and dequeuing the pointer array elements from these pointerarrays in order to iterate through the pointer arrays. Thisimplementation enables the TTE 100 to fetch the entire pointer array ina single step as opposed to multiple separate accesses at the expense ofincreased hardware overhead.

More specifically, the TTE 100 can include a control signal register 110configured to store a control signal: representing the custom sourceaccess pattern including the set of source data blocks in the sourcememory component 210; representing the custom destination storagepattern including the set of destination blocks in the destinationmemory component 220; and including the base source pointer arrayaddress, a source pointer array length, the base destination pointerarray address, and a destination pointer array length. The TTE 100 canalso further include: a source pointer array queue 180 configured tostore a set of source pointer array elements characterized by the sourcepointer array length; and a destination pointer array queue 181configured to store a set of destination pointer array elementscharacterized by the destination pointer array length. Thus, in thisimplementation, the TTE 100 can access the pointer array at the basepointer array address by loading the pointer array into a pointer arrayqueue 180 based on the base pointer address and the pointer arraylength.

In order to execute the custom data transfer operation based on thepointer array queue 180, the TTE 100 can: read the source address forthe source data block from a first pointer array element in the pointerarray queue 180; and write the source address for the source data blockto the source address register 120; read the source block count for thesource data block from the first pointer array element in the pointerarray queue 180; and write the source block count for the source datablock to the source block counter 122; and for each pointer arrayelement in the set of pointer array elements: in response to writing thesource address for the source data block to the source address register120 and in response to writing the source block count for the sourcedata block to the source block counter 122, dequeue the first pointerarray element from the pointer array queue 180. Likewise, the TTE 100can execute a similar series of steps for a destination pointer arrayqueue 181.

8.5 Pointer Array Address Register and Counter

As show in FIG. 6B, the custom pattern version of the TTE, can utilize apointer address register 190 and a pointer array counter 192 to trackthe progress of the TTE 100 as it iterates through the source ordestination pointer arrays. In this implementation, the hardwareoverhead is reduced, as only a register and counter are included foreach of the source pointer array and the destination pointer array.Thus, the TTE 100 can iterate through the source pointer array and/orthe destination pointer array by incrementing the source/destinationpointer address register 191 and a source/destination pointer arraycounter 193.

More specifically, the TTE 100 can include a control signal register 110configured to store a control signal: representing the custom sourceaccess pattern comprising the set of source data blocks in the sourcememory component 210; representing the custom destination storagepattern comprising the set of destination blocks in the destinationmemory component 220; and including the base source pointer arrayaddress, a source pointer array length, the base destination pointerarray address, and a destination pointer array length. The TTE 100 canfurther include: a source pointer address register 190; a source pointerarray counter 192; a destination pointer address register 191; and adestination pointer array counter 193.

Additionally, in this implementation, in order to iterate through thesource and/or destination pointer array. The control logic 160 of theTTE 100 can is configured to: write the base source pointer arrayaddress to the source pointer address register 190; write the sourcepointer array length to the source pointer array counter 192; write thebase destination pointer array address to the destination pointeraddress register 191; and write the destination pointer array length tothe destination pointer array counter 193. In this implementation of theTTE, the control logic 160 is also configured to, in response to acurrent source pointer array count in the source pointer array counter192 representing at least one source pointer array element remaining inthe source pointer array: read a current source pointer array address inthe source pointer address register 190; read the source address for asource data block in the set of source data blocks from the sourcepointer array element at the current source pointer array address; writethe source address for the source data block to the source addressregister 120; transfer the source data block at the source address forthe source data block to the data buffer 170; increment the currentsource pointer array address in the source pointer address register 190;and decrement the current source pointer array count in the sourcepointer array counter 192. In this implementation, the TTE 100 includescontrol logic 160 additionally configured to, in response to a currentdestination pointer array count in the destination pointer array counter193 representing at least one destination pointer array elementremaining in the destination pointer array: read a current destinationpointer array address in the destination pointer address register 191;read the destination address for a destination block in the set ofdestination blocks from the destination pointer array element at thecurrent destination pointer array address; write the destination addressfor the destination block to the destination address register 130;transfer the source data block in the data buffer 170 to the destinationaddress in the destination component; increment the current destinationpointer array address in the destination pointer address register 191;and decrement the current destination pointer array count in thedestination pointer array counter 193.

In further detail, the TTE 100 can write the source address for thesource data block to the source address register 120 by: reading acurrent pointer array address in the pointer address register; readingthe source address for the source data block from the pointer arrayelement at the current pointer array address; and writing the sourceaddress for the source data block to the source address register 120.Additionally, the TTE 100 can write the source block count for thesource data block to the source block counter 122 by: reading thecurrent pointer array address in the pointer address register; readingthe source block count for the source data block from the pointer arrayelement at the current pointer array address; and writing the sourceblock count for the source data block to the source block counter 122.The TTE 100 can then, for each pointer array element in the set ofpointer array elements: in response to writing the source address forthe source data block to the source address register 120 and in responseto writing the source block count for the source data block to thesource block counter 122, incrementing the current pointer array addressin the pointer address register.

Thus, in this implementation, the TTE 100 can access the pointer arrayat the base pointer array address by writing the base pointer arrayaddress to a pointer address register and writing the pointer arraylength to a pointer array counter 192 prior to executing a while loop torepeatedly access consecutive pointer array elements from the pointerarray.

The systems and methods described herein can be embodied and/orimplemented at least in part as a machine configured to receive acomputer-readable medium storing computer-readable instructions. Theinstructions can be executed by computer-executable componentsintegrated with the application, applet, host, server, network, website,communication service, communication interface,hardware/firmware/software elements of a user computer or mobile device,wristband, smartphone, or any suitable combination thereof. Othersystems and methods of the embodiment can be embodied and/or implementedat least in part as a machine configured to receive a computer-readablemedium storing computer-readable instructions. The instructions can beexecuted by computer-executable components integrated bycomputer-executable components integrated with apparatuses and networksof the type described above. The computer-readable medium can be storedon any suitable computer readable media such as RAMs, ROMs, flashmemory, EEPROMs, optical devices (CD or DVD), hard drives, floppydrives, or any suitable device. The computer-executable component can bea processor but any suitable dedicated hardware device can(alternatively or additionally) execute the instructions.

As a person skilled in the art will recognize from the previous detaileddescription and from the figures and claims, modifications and changescan be made to the embodiments of the invention without departing fromthe scope of this invention as defined in the following claims.

I claim:
 1. A method for executing a data transfer operation from asource memory component to a destination memory component, the methodcomprising: accessing a pointer array comprising a first pointer arrayelement representing a first source data block in the source memorycomponent, the first pointer array element comprising: a first sourceaddress for the first source data block; and a first source block lengthfor the first source data block; writing the first source address to asource address register as a current source address; writing the firstsource block length to a source block counter as a current source blockcount; transferring a first source data word stored at the currentsource address in the source address register to a current destinationaddress in a destination address register; incrementing the currentsource address in the source address register; and decrementing thecurrent source block count in the source block counter.
 2. The method ofclaim 1, further comprising: accessing a control signal in a controlsignal register, the control signal comprising an initial destinationaddress; writing the initial destination address to the destinationaddress register as the current destination address; and incrementingthe current destination address in the destination address register. 3.The method of claim 1, further comprising: accessing a control signal ina control signal register, the control signal comprising: a base pointerarray address; and a pointer array length; and loading the pointer arrayinto a pointer array queue based on the base pointer address and thepointer array length.
 4. The method of claim 3, further comprising:reading the first source address from the first pointer array element inthe pointer array queue; reading the first source block length from thefirst pointer array element in the pointer array queue; and dequeuingthe first pointer array element from the pointer array queue.
 5. Themethod of claim 1, further comprising: writing a base pointer arrayaddress to a pointer address register as a current pointer arrayaddress; and for each pointer array element in a set of pointer arrayelements in the pointer array: reading a current pointer array addressin the pointer address register; reading a respective source address fora respective source data block from a respective pointer array elementat the current pointer array address; writing the respective sourceaddress to the source address register as the current source address;reading a respective source block length for the respective source datablock from the pointer array element at the current pointer arrayaddress; writing the respective source block length for the source datablock to the source block counter as the current source block count; andincrementing the current pointer array address in the pointer addressregister.
 6. The method of claim 1, wherein transferring the firstsource data word stored at the current source address in the sourceaddress register to the current destination address in a destinationaddress register comprises: loading the first source data word from thecurrent source address into a transpose buffer according to a firstbuffer dimension of the transpose buffer; and transferring the firstsource data word from the transpose buffer according to a second bufferdimension of the transpose buffer.
 7. The method of claim 1, whereintransferring the first source data word stored at the current sourceaddress in the source address register to the current destinationaddress in a destination address register comprises: loading the firstsource data word from the current source address into a data buffer; andtransferring the first source data word from the data buffer to thecurrent destination address.
 8. The method of claim 1, furthercomprising: accessing a control signal in a control signal register, thecontrol signal: representing a source access pattern in the sourcememory component, the source access pattern defining a first dimension;and comprising a first source stride length in the first dimension; andadvancing the current source address in the source address registerbased on the first source stride length and the current source address.9. The method of claim 1, further comprising: accessing a control signalin a control signal register, the control signal: representing adestination storage pattern in the destination memory component, thedestination storage pattern defining a first dimension; and comprising afirst destination stride length in the first dimension; and advancingthe current destination address in the destination address registerbased on the first destination stride length and the current destinationaddress.
 10. A method for executing a data transfer operation, themethod comprising: accessing a source pointer array comprising a firstsource pointer array element representing a first source data block in asource memory component, the first source pointer array elementcomprising: a first source address for the first source data block; anda first source block length for the first source data block; writing thefirst source address to a source address register as a current sourceaddress; writing the first source block length to a source block counteras a current source block count; enqueuing a first source data wordstored at a current source address in the source address register to adata buffer; incrementing the current source address in the sourceaddress register; and decrementing the current source block count in thesource block counter.
 11. The method of claim 10, further comprising:accessing a destination pointer array comprising a first destinationpointer array element representing a first destination block in adestination memory component, the first destination pointer arrayelement comprising: a first destination address for the firstdestination block; and a first destination block length for the firstdestination data block; writing the first destination address to adestination address register as a current destination address; writingthe first destination block length to a destination block counter as acurrent destination block count; dequeuing the first source data wordstored in the data buffer to the current destination address in thedestination address register; incrementing the current destinationaddress in the destination address register; and decrementing thedestination block count in the destination block counter.
 12. The methodof claim 11: further comprising accessing a control signal in a controlsignal register, the control signal: representing a custom destinationstorage pattern comprising the first destination block; and comprising abase destination pointer array address; and wherein accessing adestination pointer array comprises accessing the destination pointerarray at the base destination pointer array address.
 13. The method ofclaim 10: further comprising accessing a control signal in a controlsignal register, the control signal: representing a custom source accesspattern comprising the first source data block; and comprising a basesource pointer array address; and wherein accessing a source pointerarray comprises accessing the source pointer array at the base sourcepointer array address.
 14. A tensor traversal engine in a processorsystem comprising a source memory component and a destination memorycomponent, the tensor traversal engine comprising: a source pointeraddress register; a source pointer array counter; a data buffer; andcontrol logic configured to: read a current source pointer array addressin the source pointer address register; read a source address for asource data block in the source memory component from a source pointerarray element at the current source pointer array address; transfer thesource data block at the source address to the data buffer; incrementthe current source pointer array address in the source pointer addressregister; and decrement a current source pointer array count in thesource pointer array counter.
 15. The tensor traversal engine of claim14: further comprising a destination pointer address register; furthercomprising a destination pointer array counter; and wherein the controllogic is further configured to: read a current destination pointer arrayaddress in the destination pointer address register; read a destinationaddress for a destination block in the destination memory component froma destination pointer array element at the current destination pointerarray address; transfer the source data block in the data buffer to thedestination address; increment the current destination pointer arrayaddress in the destination pointer address register; and decrement acurrent destination pointer array count in the destination pointer arraycounter.
 16. The tensor traversal engine of claim 15: further comprisinga control signal register configured to store a control signal for adata transfer operation from the source memory component to thedestination memory component, the control signal comprising: a basedestination pointer array address; and a destination pointer arraylength; and wherein the control logic is further configured to: writethe base destination pointer array address to the destination pointeraddress register; and write the destination pointer array length to thedestination pointer array counter.
 17. The tensor traversal engine ofclaim 14: further comprising a control signal register configured tostore a control signal for a data transfer operation from the sourcememory component to the destination memory component, the control signalcomprising: a base source pointer array address; and a source pointerarray length; and wherein the control logic is further configured to:write the base source pointer array address to the source pointeraddress register; and write the source pointer array length to thesource pointer array counter.
 18. The tensor traversal engine of claim14, further comprising: a control signal register configured to store acontrol signal for a strided data transfer operation from the sourcememory component to the destination memory component, the control signalcomprising a destination stride count; and a destination stride counterconfigured to store the destination stride count.
 19. The tensortraversal engine of claim 14, further comprising: a control signalregister configured to store a control signal for a strided datatransfer operation from the source memory component to the destinationmemory component, the control signal comprising a source stride count; asource stride counter configured to store the source stride count. 20.The tensor traversal engine of claim 14, further comprising: a controlsignal register configured to store a control signal for a data transferoperation from the source memory component to the destination memorycomponent, the control signal comprising: a source pointer array length;and a destination pointer array length; a source pointer array queueconfigured to store a set of source pointer array elements characterizedby the source pointer array length; and a destination pointer arrayqueue configured to store a set of destination pointer array elementscharacterized by the destination pointer array length.